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Abstract 



Collaborative filtering is a rapidly advancing research area. Every year several new 
techniques arc proposed and yet it is not clear which of the techniques work best and 
under what conditions. In this paper we conduct a study comparing several collabo- 
^ rative filtering techniques - both classic and recent state-of-the-art - in a variety of 

I— I experimental contexts. Specifically, we report conclusions controlling for number of 

items, number of users, sparsity level, performance criteria, and computational com- 
plexity. Our conclusions identify what algorithms work well and in what conditions, 
C_) and contribute to both industrial deployment collaborative filtering algorithms and to 

the research community. 

> 

^ 1 Introduction 

Collaborative filtering is a rapidly advancing research area. Classic methods include neigh- 
borhood methods such as memory based or user based collaborative filtering. More recent 
^ methods often revolve around matrix factorization including singular value decomposition 

^ and non-negative matrix factorization. New methods are continually proposed, motivated 

>■ by experiments or theory. However, despite the considerable research momentum there is no 

^ consensus or clarity on which method works best. 

One difficulty that is undoubtedly central to the lack of clarity is that the performance of 
different methods differ substantially based on the problem parameters. Specifically, factors 
such as number of users, number of items, and sparsity level (ratio of observed to total 
ratings) affect different collaborative filtering methods in different ways. Some methods 
perform better in sparse setting while others perform better in dense settings, and so on. 

Existing experimental studies in collaborative filtering either do not compare recent state- 
of-the-art methods, or do not investigate variations with respect to the above mentioned 
problem parameters. In this paper we do so, and concentrate on comparing both classic and 
recent state-of-the-art methods. In our experiments we control for the number of users, the 
number of items, sparsity level and consider multiple evaluation measure, and computational 
cost. 
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Based on our comparative study we conclude the following. 

1. Generally speaking, Matrix- Factorization-based methods perform best in terms of pre- 
diction accuracy. Nevertheless, in some special cases, several other methods have a 
distinct advantage over Matrix-Factorization methods. 

2. The prediction accuracy of the different algorithms depends on the number of users, 
the number of items, and density, where the nature and degree of dependence differs 
from algorithm to algorithm. 

3. There exists a complex relationship between prediction accuracy, its variance, compu- 
tation time, and memory consumption that is crucial for choosing the most appropriate 
recommendation system algorithm. 

The following sections describe in detail the design and implementation of the experi- 
mental study and the experimental results themselves. 

2 Background and Related Work 

Before describing our experimental study, we briefly introduce recommendation systems and 
collaborative filtering techniques. 

2.1 Recommendation Systems 

Broadly speaking, any software system which actively suggests an item to purchase, to 
subscribe, or to invest can be regarded as a recommender system. In this broad sense, 
an advertisement also can be seen as a recommendation. We mainly consider, however, a 
narrower definition of "personalized" recommendation system that base recommendations 
on user specific information. 

There are two main approaches to personalized recommendation systems: content-based 
filtering and collaborative filtering. The former makes explicit use of domain knowledge 
concerning users or items. The domain knowledge may correspond to user information such 
as age, gender, occupation, or location or to item information such as genre, producer, 
or length in the case of movie recommendation. 

The latter category of collaborative filtering (CF) does not use user or item information 
with the exception of a partially observed rating matrix. The rating matrix holds ratings of 
items (columns) by users (rows) and is typically binary, for example like vs. do not like, or 
ordinal, for example, one to five stars in Netfiix movie recommendation. The rating matrix 
may also be gathered implicitly based on user activity, for example a web search followed 
by click through may be interpreted as a positive value judgement for the chosen hyperlink 
[TT] . In general, the rating matrix is extremely sparse, since it is unlikely that each user 
experienced and provided ratings for all items. 
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Hybrid collaborative and content-based filtering strategies combine the two approaches 
above, using both the rating matrix, and user and item information. [26l H2l [271 l22l [T5l 
El EH [35] . Such systems typically obtain improved prediction accuracy over content-based 
filtering systems and over collaborative filtering systems. 

In this paper we focus on a comparative study of collaborative filtering algorithms. There 
are several reason for not including content-based filtering. First, a serious comparison of 
collaborative filtering systems is a challenging task in itself. Second, experimental results of 
content-based filtering are intimately tied to the domain and are not likely to transfer from 
one domain to another. Collaborative filtering methods, on the other hand, use only the 
rating matrix which is similar in nature across different domains. 

2.2 Collaborative Filtering 

Collaborative filtering systems are usually categorized into two subgroups: memory-based 
and model-based methods. 

Memory-based methods simply memorize the rating matrix and issue recommendations 
based on the relationship between the queried user and item and the rest of the rating matrix. 
Model-based methods fit a parameterized model to the given rating matrix and then issue 
recommendations based on the fitted model. 

The most popular memory-based CF methods are neighborhood-based methods, which 
predict ratings by referring to users whose ratings are similar to the queried user, or to items 
that are similar to the queried item. This is motivated by the assumption that if two users 
have similar ratings on some items they will have similar ratings on the remaining items. Or 
alternatively if two items have similar ratings by a portion of the users, the two items will 
have similar ratings by the remaining users. 

Specifically, user-based CF methods [5] identify users that are similar to the queried user, 
and estimate the desired rating to be the average ratings of these similar users. Similarly, 
item-based CF [31] identify items that are similar to the queried item and estimate the 
desired rating to be the average of the ratings of these similar items. Neighborhood methods 
vary considerably in how they compute the weighted average of ratings. Specific examples 
of similarity measures that influence the averaging weights are include Pearson correlation. 
Vector cosine, and Mean-Squared-Difference (MSD). Neighborhood-based methods can be 
extended with default votes, inverse user frequency, and case amplification |S]. A recent 
neighborhood-based method [37] constructs a kernel density estimator for incomplete partial 
rankings and predicts the ratings that minimize the posterior loss. 

Model-based methods, on the other hand, fit a parametric model to the training data that 
can later be used to predict unseen ratings and issue recommendations. Model-based meth- 
ods include cluster-based CF [SSJ (HI [3 [221 [ID] , Bayesian classifiers [231 El] , and regression- 
based methods [39] . The slope-one method [20] fits a linear model to the rating matrix, 
achieving fast computation and reasonable accuracy. 

A recent class of successful CF models are based on low-rank matrix factorization. The 
regularized SVD method |^ factorizes the rating matrix into a product of two low rank 
matrices (user-profile and item-profile) that are used to estimate the missing entries. An 
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alternative method is Non-negative Matrix Factorization (NMF) |TS] that differs in that 
it constrain the low rank matrices forming the factorization to have non-negative entries. 
Recent variations are Probabilistic Matrix Factorization (PMF) [30], Bayesian PMF [29] . 
Non-linear PMF [IT], Maximum Margin Matrix Factorization (MMMF) [SI EH E], and 
Nonlinear Principal Component Analysis (NPCA) [H]. 

2.3 Evaluation Measures 

The most common CF evaluation measure for prediction accuracy are the mean absolute 
error (MAE) and root of the mean square error (RMSE): 



MAE = 
RMSE = 




where pu,i and ru,i are the predicted and observed rating for user u and item i, respectively. 
The sum above ranges over a labeled set that is set aside for evaluation purposes (test set). 
Other evaluation measures are precision, recall, and Fl measures 

Gunawardana and Shani [TO] argue that different evaluation metrics lead different conclu- 
sion concerning the relative performance of the CF algorithms. However, most CF research 
papers motivate their algorithm by examining a single evaluation measure. In this paper we 
consider the performance of different CF algorithms as a function of the problem parameters, 
measured using several different evaluation criteria. 

2.4 Related Work 

Several well-written surveys on recommendation systems are available. Adomavicius and 
Tuzhilin [I] categorized CF algorithms available as of 2006 into content-based, collaborative, 
and hybrid and summarized possible extensions. Su and Khoshgoftaar [SS] concentrated 
more on CF methods, including memory-based, model-based, and hybrid methods. This 
survey contains most state-of-the-art algorithms available as of 2009, including Netfiix prize 
competitors. A recent textbook on recommender systems introduces traditional techniques 
and explores additional issues like privacy concerns [TB] . 

There are a couple of experimental studies available. The first study by Breese et al. [5j 
compared two popular memory-based methods (Pearson correlation and vector similarity) 
and two classical model-based methods (clustering and Bayesian network) on three different 
dataset. A more recent experimental comparison of CF algorithms [12] compares user-based 
CF, item-based CF, SVD, and several other model-based methods, focusing on e-commerce 
applications. It considers precision, recall, Fl-measure and rank score as evaluation mea- 
sures, with comments about the computational complexity issue. This however ignores some 
standard evaluation measures such as MAE or RMSE. 
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2.5 Netflix Prize and Dataset 



The Netflix competition was held between October 2006 and July 2009, when BellKor's 
Pragmatic Chaos team won the million-dollar-prize. The goal of this competition was im- 
proving the prediction accuracy (in terms of RMSE) of the Netflix movie recommendation 
system by 10%. The winning team used a hybrid method that used temporal dynamics 
to account for dates in which ratings were reported [IHl E]- The second-placed place, The 
Ensemble [33], achieved comparable performance to the winning team, by linearly combining 
a large number of models. 

Although the Netflix competition has finished, the dataset used in that competition is 
still used as a standard dataset for evaluating CF methods. It has 480,046 users and 17,770 
items with 95,947,878 ratings. This represents a sparsiy level of 1.12% (total number of 
entries divided by observed entries). Older and smaller standard datasets include MovieLens 
(6,040 users, 3,500 items with 1,000,000 ratings), EachMovie (72,916 users, 1,628 items with 
2,811,983 ratings), and BookCrossing (278,858 users, 271,379 items with 1,149,780 ratings). 

3 Experimental Study 

We describe below some details concerning our experimental study, and then follow with a 
description of our major findings. 

3.1 Experimental Design 

To conduct our experiments and to facilitate their reproducability we implemented the 
PREAj^toolkit, which implements the 15 algorithms listed in TablejlJ The toolkit is available 
for public usage and will be updated with additional state-of-the-art algorithms proposed by 
the research community. 

There are three elementary baselines in Table [l| a constant function (identical prediction 
for all users and all items), user average (constant prediction for each user-based on their 
average ratings), and item average (constant prediction for each item-based on their average 
ratings). The memory-based methods listed in Table [l] are classical methods that perform 
well and are often used in commercial settings. The methods listed under the matrix fac- 
torization and others categories are more recent state-of-the-art methods proposed in the 
research literature. 

In our experiments we used the Netflix dataset, a standard benchmark in the CF literature 
that is larger and more recent than alternative benchmarks. To facilitate measuring the 
dependency between prediction accuracy and dataset size and density, we sorted the rating 
matrix so that its rows and columns are listed in order of descending density level. We then 
realized specific sparsity pattern by selecting the top k rows and / columns and subsampling 
to achieve the required sparsity. 

"'^http: / / www.netflixprize.com 
^http://prea. gatech.edu 
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Category 


Algorithms 




Constant 


Baseline 


User Average 




Item Average 




User-based [3B] 


Memory 


User-based w/ Default [5] 


-based 


Item-based [31] 




Item-based w/ Default 




Regularized SVD ^ 


Matrix 


NMF [19] 


Factorization 


PMF [30] 


-based 


Bayesian PMF ^ 




Non-linear PMF [H] 




Slope-One [20] 


Others 


NPCA [41] 




Rank-based CF [37] 



Table 1: List of Recommendation Algorithms used in Experiments 
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Figure 1: Rating density (cumulative in the left panel and non-cumulative in the right panel) 
in Netfiix rating matrix, sorted by descending density of rows and columns. See text for more 
detail. 

Figure [T] shows the density level of the sorted rating matrix. For instance, the top right 
corner of the sorted rating matrix containing the top 5,000 users and top 2,000 items has 
52.6% of density. In other words, there 47.4% of the ratings are missing. The density of 
the entire dataset is around 1%. We subsample a prescribed level of density which will be 
used for training as well as 20% more for the purpose of testing. We cross validate each 
experiment 10 times with different train-test splits. The experiments were conducted on a 
dual Intel Xeon X5650 processor (6 Core, 12 Threads, 2.66GHz) with 96GB of main memory. 

3.2 Dependency on Data Size and Density 

We start by investigating the dependency of prediction accuracy on the dataset size (number 
of users and number of items) and on the rating density. Of particular interest, is the 
variability in that dependency across different CF algorithms. This variability holds the 
key to determining which CF algorithms should be used in a specific situation. We start 
below by considering the univariate dependency of prediction accuracy on each of these three 
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Figure 2: Prediction loss as a function of user count (top) item count (middle), density 
(bottom). 

quantities: number of users, number of items, and density level. We then conclude with an 
investigation of the multivariate dependency between the prediction accuracy and these three 
variables. 

3.2.1 Dependency on User Count 

Figure [2] (top row) graphs the dependency of mean absolute error (MAE) on the number of 
users with each of the three panels focusing on CF methods in a specific category (memory- 
based and baselines, model-based, and other). The item count and density were fixed at 
2,000 and 3%, respectively. The RMSE evaluation measure shows very similar trend. 

We omitted the simplest baseline of constant prediction rule, since its performance is 
much worse than the others. The default voting variants of the user-based and item-based 
CF methods did not produce noticable changes (we graphed the variant which worked the 
best). 
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To compare the way in which the different algorithms depend on the user count quantity, 
we fitted a linear regression model to the curves in Figure [2] (top row). The slope m and 
intercept b regression coefficients appear in Table [2j The intercept b indicates the algorithm's 
expected MAE loss when the number of users approach 0. The slope m indicates the rate 
of decay of the MAE. 

Looking at Figure [2] and Table |2| we make the following observations. 

1. Matrix factorization methods in general show better performance when the number of 
users gets sufficiently large (> 3,000). 

2. Overall, the best performing algorithm is regularized SVD. 

3. When the number of users is sufficiently small there is very little difference between 
matrix factorization methods and the simpler neighborhood-based methods. 

4. Item average, item-based, regularized SVD, PMF, BPMF, and NLPMF tend to be the 
most sensitive to variation in user count. 

5. Constant baseline, user average, user-based, NMF, NPCA, and rank-based are rela- 
tively insensitive to the number of users. 

6. There is stark difference in sensitivity between the two popular neighborhood-based 
methods: user-based CF and item-based CF. 

7. User-based CF is extremely effective for low user count but has an almost constant 
dependency on the user count. Item-based CF performs considerably worse at first, 
but outperforms all other memory-based methods for larger user count. 



3.2.2 Dependency on Item Count 



In analogy with Section 3.2.1 we investigate here the dependency of the prediction loss 
on the number of items, fixing the user count at and density at 5,000 and 3%, respectively. 
Figure [2] (middle row) shows the MAE as a function of the number of items for three different 
categories of CF algorithms. Table |2] shows the regression coefficients (see description in 



Section 3.2.1). 



Looking at Figure |2] and Table [2| we make the following observations that are largely in 



agreement with the observations in Section 3.2.1 



1. Matrix factorization methods in general show better performance when the number of 
items gets sufficiently large (> 1,000). 

2. Overall, the best performing algorithm is regularized SVD. 

3. When the number of users is sufficiently small there is very little difference between 
matrix factorization methods and the simpler neighborhood-based methods. 
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Algorithm 


User Count 


Item Count 


Density 




b 


10^'^m 


b 


m 


b 


Constant 


-1.7636 


0.9187 


-13.8097 


0.9501 


-0.0691 


0.9085 


User Average 


-2.5782 


0.8048 


-3.5818 


0.8006 


-0.6010 


0.8094 


Item Average 


-6.6879 


0.8593 


+1.1927 


0.8155 


-0.2337 


0.8241 


User-based 


-3.9206 


0.7798 


-18.1648 


0.8067 


-4.7816 


0.9269 


User-based (Default values) 


-3.6376 


0.7760 


-19.3139 


0.8081 


-4.7081 


0.9228 


Item-based 


-10.0739 


0.8244 


-3.2230 


0.7656 


-4.7104 


0.9255 


Item-based (Default values) 


-10.4424 


0.8271 


-3.6473 


0.7670 


-4.9147 


0.9332 


Slope-one 


-5.6624 


0.7586 


-7.5467 


0.7443 


-5.1465 


0.9112 


Regularized SVD 


-7.6176 


0.7526 


-14.1455 


0.7407 


-2.2964 


0.7814 


NMF 


-4.4170 


0.7594 


-5.7830 


0.7481 


-0.9792 


0.7652 


PMF 


-6.9000 


0.7531 


-14.7345 


0.7529 


-6.3705 


0.9364 


Bayesian PMF 


-11.0558 


0.7895 


-23.7406 


0.7824 


-4.9316 


0.8905 


Non-linear PMF 


-8.8012 


0.7664 


-14.7588 


0.7532 


-2.8411 


0.8135 


NPCA 


-4.1497 


0.7898 


-7.2994 


0.7910 


-3.5036 


0.8850 


Rank-based CF 


-3.8024 


0.7627 


-7.3261 


0.7715 


-2.4686 


0.8246 



Table 2: Regression coefficients y = mx + b for the curves in Figure |2} The variable x repre- 
sents user count, item count, or density, and the variable y represents the MAE prediction 
loss. 



4. User-based, regularized SVD, PMF, BPMF, and NLPMF tend to be the most sensitive 
to variation in item count. 

5. Item average, item-based, and NMF tend to be less sensitive to variation in item count. 

6. There is stark difference in sensitivity between the two popular neighborhood-based 
methods: user-based CF and item-based CF. 

7. Item-based CF is extremely effective for low item count but has an almost constant 
dependency on the item count. User-based CF performs considerably worse at ffist, 
but outperforms all other memory-based methods significantly for larger user count. 



Combined with the observations in Section 3.2.1, we conclude that slope-one, NMF, 
and NPCA are insensitive to variations in both user and item count. PMF and BPMF 
are relatively sensitive to variations in both user and item count. 



3.2.3 Dependency on Density 



Figure [2] (bottom row) graphs the dependency of the MAE loss on the rating density, fixing 
user and item count at 5,000 and 2,000. As in Section |3.2.1[ Table [2] displays regression 



coefficients corresponding to performance at near density and sensitivity of the MAE 
function to the density level. 

Looking Figure |2] (bottom row) and Table |2] we make the following observations. 

1. The simple baselines (user average and item average) work remarkably well for low 
density. 
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2. The best performing algorithms seem to be regularized SVD. 

3. User-based CF and item-based CF show a remarkable similar dependency on density 
level. This is in stark contrast to their different dependencies on the user and item 
count. 

4. As the density level increases the differences in prediction accuracy of the different 
algorithm shrink. 

5. User-based, item-based, slope-one, PMF, and BPMF are largely dependent on density. 

6. The three baselines and NMF are relatively independent from density. 

7. The performance of slope-one and PMF degrade significantly at low densities, perform- 
ing worse than the weakest baselines. Nevertheless both algorithms feature outstanding 
performance at high densities. 

3.2.4 Mutlivariate Dependencies between Prediction Loss, User Count, Item 
Count, and Density 

The univariate dependencies examined previously show important trends but are limited 
since they examine variability of one quantity while fixing the other quantities to arbitrary 
values. We now turn to examine the dependency between prediction loss and the following 
variables: user count, item count, and density. We do so by graphing the MAE as a function 
of user count, item count, and density (Figure |3]-[4]) and by fitting multivariate regression 
models to the dependency of MAE on user count, item count, and density (Table [s]). 

Figure [3]-|4] shows the equal height contours of the MAE as a function of user count (x 
axis) and item count [y axis) for multiple density levels (horizontal panels) and for multiple 
CF algorithms (vertical panels). Note that all contour plots share the same x and y axes 
scales and so are directly comparable. Intervals between different contour lines represent a 
difference of 0.01 in MAE and so more contour lines represent higher dependency on the x 
and y axis. Analogous RMSE graphs show similar trend to these MAE graphs. 

Table |3] displays the regression coefficients mu,mi,md corresponding to the linear model 

y = niuXu + miXi + mdXd + b (1) 

where Xu, Xi, and x^ indicate user count, item count, and density, b is the constant term, 
and y is the MAE. 

Based on Figure |3]-|4] and Table [3] we make the following observations. 

1. The univariate relationships discovered in the previous section for fixed values of the 
remaining two variables do not necessarily hold in general. For example, the conclusion 
that PMF is relatively sensitive to user count and item count at 1% sparsity level 
and 3% sparsity level is not longer valid for 5% density levels. It is thus important 
to conduct a multivariate, rather than univariate, analysis of the dependency of the 
prediction loss on the problem parameters. 
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Figure 3: MAE Contours for simple methq(| (Lower values mean better performance. 
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2. The shape of the MAE contour curves vary from algorithm to algorithm. For example, 
the contour curves of constant, user average, user-based, and user-based with default 
values are horizontal, implying that these algorithms depend largely on the number 
of items, regardless of user count. On the other hand, item average, item-based, and 
item-based with default values show vertical contour lines, showing a high dependency 
on the user count. 

3. Higher dependency on user count and item count is correlated with high dependency 
on density. 

4. The last column of Table [3] summarizes dependency trends based on the absolute value 
of the regression coefficients and their rank. Generally speaking, baselines are relatively 
insensitive, memory-based methods are dependent on one variable (opposite to their 
names), and matrix factorization methods are highly dependent on both dataset size 
and density. 



Algorithm 




rrii 


"Id 


b 


Summary 


Constant 


(15) 


-0.0115 


(8) 


-0.0577 


(15) 


-0.0002 


0.9600 


Weekly dependent on all variables. 


User Average 


(13) 


-0.0185 


(14) 


-0.0164 


(13) 


-0.0188 


0.8265 


Weekly dependent on all variables. 


Item Average 


(8) 


-0.0488 


(15) 


+0.0003 


(14) 


-0.0078 


0.8530 


Weekly dependent on all variables. 


User-based 


(11) 


-0.0282 


(3) 


-0.0704 


(1) 


-0.2310 


0.9953 


Weekly dependent on user count. 


User-based (w/Default) 


(12) 


-0.0260 


(7) 


-0.0598 


(5) 


-0.2185 


0.9745 


Weekly dependent on user count. 


Item-based 


(3) 


-0.0630 


(12) 


-0.0172 


(4) 


-0.2201 


0.9688 


Weekly dependent on item count. 


Item-based (w/Default) 


(2) 


-0.0632 


(13) 


-0.0167 


(2) 


-0.2286 


0.9751 


Weekly dependent on item count. 


Slope-one 


(1) 


-0.0746 


(4) 


-0.0702 


(8) 


-0.1421 


0.9291 


Strongly dependent on all variables. 


Regularized SVD 


(7) 


-0.0513 


(9) 


-0.0507 


(11) 


-0.0910 


0.8371 


Strongly dependent on dataset size. 


NMF 


(9) 


-0.0317 


(10) 


-0.0283 


(12) 


-0.0341 


0.7971 


Weekly dependent on all variables. 


PMF 


(5) 


-0.0620 


(2) 


-0.1113 


(3) 


-0.2269 


0.9980 


Strongly dependent on all variables. 


Bayesian PMF (BPMF) 


(4) 


-0.0628 


(1) 


-0.1126 


(6) 


-0.1999 


0.9817 


strongly dependent on all variables. 


Non-linear PMF (NLPMF) 


(6) 


-0.0611 


(6) 


-0.0599 


(9) 


-0.1165 


0.8786 


Strongly dependent on dataset size. 


NPCA 


(14) 


-0.0184 


(11) 


-0.0213 


(7) 


-0.1577 


0.9103 


Weekly dependent on dataset size. 


Rank-based CF 


(10) 


-0.0295 


(5) 


-0.0687 


(10) 


-0.1065 


0.8302 


strongly dependent on item count. 



Table 3: Regression coefficients for the model y = rriuZu + rriiZi + mdZd + h ([!])) where y is 
MAE, Zi, and z^, are inputs from Xu, Xi, and x^, normalized to achieve similar scales: 
Zu = Xu/lO, 000, Zi = Xi/3, 000, and z^ = Xd/0.05. The rank on each variable is indicated in 
parenthesis with rank 1 showing highest dependency.) 



3.3 Accuracy Comparison 

Figure [5] shows the best performing algorithm (in terms of MAE) as a function of user count, 
item count, and density. We make the following conclusions. 

1. The identity of the best performing algorithm varies is non-linearly dependent on user 
count, item count, and density. 

2. NMF is dominant low density cases while BPMF works well for high density cases 
(especially for high item and user count). 
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Figure 5: Best-performing algorithms in MAE for given user count, item count, and density. 



3. Regularized SVD and PMF perform well for density levels 2%-4%. 

Analogous RMSE graphs show similar trends with regularized SVD outperforming other 
algorithms in most regions. 

3.4 Asymmetric and Rank-based Metrics 

We consider here the effect of replacing the MAE or RMSE with other loss functions, specif- 
ically with asymmetric loss and with rank-based loss. 

Asymmetric loss is motivated with the fact that recommending an undesirable item is 
worse than avoiding recommending a desirable item. In other words, the loss function L{a, b), 
measuring the effect of predicting rating b when the true rating is a is an asymmetric function. 
Specifically, we consider the loss function L{a, b) defined by the following matrix (rows and 
columns express number of stars on a 1-5 scale) 











7.5 


10\ 











4 


6 











1.5 


3 


3 


2 


1 








^4 


3 


2 





0/ 



This loss function represents two beliefs: 1) Difference among items to be recommended is 
not important. Assuming that we issue recommendations with rating 4 or 5, no loss is given 
between the two. In the same way, we do not penalize error among items which will not 
be recommended. 2) We give severer penalty for recommending bad items than for missing 
potentially preferable items. For the latter case, the loss is the exact difference between the 
prediction and ground truth. For the former case, however, we give higher penalty. For 
example, penalty is 10 for predicting worst item with true score 1 as score 5, higher than 
4 for the opposite way of prediction. In many practical cases involving recommendation 
systems, asymmetric loss functions provide a more realistic loss function than symmetric 
loss functions such as MAE or RMSE. 
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Rank-based loss function are based on evaluating a ranked list of recommended items, 
presented to a user. The evaluation of the list gives higher importance to good recommen- 
dations at the top of the list, than at the bottom of the list. One specific formula, called 
half life utility (HLU) O [12] assumes an exponential decay in the list position. Formally, 
the utility function associated with a user u 

_ max(ru,i - d, 0) 

2^ 2{i-i)/ia-i) y^^) 

i=l 

where is the number of recommended items (length of the list), r„ j is the rating of user u 
for item i in the list, and d and a are constants, set to d = 3, and a = 5 (we assume = 10). 
The final utility function is divided by the maximum possible utility for the user, average 
over all test users |5|. Alternative rank-based evaluations are based on NDCG [H], and 
Kendall's Tau, and Spearman's Rank Correlation Coefficient [21j. 



3.4.1 Asymmetric Loss 

Figure [6]-[7| shows equal level contour plots of the asymmetric loss function (|2]), as a function 
of user count, item count, and density level. We make the following observations. 

1. The shape and density pattern of the contour lines differ from the shape of the contour 
lines in the case of the MAE. 

2. In general, regularized SVD outperforms all other algorithms. Other matrix factor- 
ization methods (PMF, BPMF, and NLPMF) perform relatively well for dense data. 
With sparse data, NMF performs well. 



3.4.2 Rank-based Evaluation Measures 

Figure |8]-[9] show equal level contours of the HLU function ([s]). Figure 10 shows the best 
performing algorithm for different user count, item count, and density. We make the following 
observations. 

1. The contour lines are generally horizontal, indicating that performance under HLU 
depend largely on the number of items and is less affected by the number of users. 

2. The HLU score is highly sensitive to the dataset density. 



3. Regularized SVD outperforms other methods (see Figure 10) in most settings. The 
simple baseline item average is best for small and sparse datasets. A similar comment 
can be made regarding NPCA. NMF and slope-one perform well for sparse data, though 
they lag somewhat the previous mentioned algorithms. 

Other rank-based loss functions based on NDCG, Kendall's Tau, and Spearman show 
similar trends. 
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Figure 6: Asymmetric Loss Contours for simp^ methods (Lower values mean better perfor- 
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Figure 8: Half-Life Utility Contours for simj^^ method (Higher values mean better perfor- 
mance.) 
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Figure 10: Best-performing algorithms in HLU for given user count, item count, and density. 



3.5 Computational Considerations 

As Figure [TT] shows, the computation time varies significantly between different algorithms. 
It is therefore important to consider computational issues when deciding on the appropriate 
CF algorithm. We consider three distinct scenarios, listed below. 



• Unlimited Time Resources: We assume in this case that we can afford arbitrarily 
long computation time. This scenario is realistic in some cases involving static training 
set, making offline computation feasible. 

• Constrained Time Resource: We assume in this cases some mild constraints on 
the computation time. This scenario is realistic in cases where the training set is 
periodically updated, necessitating periodic re-training with updated data. We assume 
here that the training phase should tale within an hour or so. Since practical datasets 
like Netfiix full set are much bigger than the subsampled one in our experiments, we 
use much shorter time limit: 5 minutes and 1 minute. 

• Real-time Applications: We assume in this case that severe constraints on the 
computation time. This scenario is realistic in cases where the training set changes 
continuously. We assume here that the training phase should not exceed several sec- 
onds. 



Figure |5] and 11 show the best performing CF algorithm (in terms of MAE) in several 
different time constraint function of the user count, item count and density. 

We make the following observations. 

1. When there are no computation constraints, the conclusions from the previous sections 
apply. Specifically, NMF performs best for sparse dataset, BPMF performs best for 
dense dataset, and regularized SVD and PMF perform the best otherwise (PMF works 
well with smaller user count while Regularized SVD works well smaller item counts). 

2. When the time constraint is 5 minutes. Regularized SVD, NLPMF, NPCA, and Rank- 



based CF (the ones colored darkly in Figure 11) are not considered. In this setting 
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Figure 11: Computation time (Train time + Test time) for each algorithm. Legend on the 
right indicates relation between color scheme and computation time. Time constraints (5 
minutes and 1 minutes) used in this article are marked as well. User count increases from left 
to right, and item count increases from bottom to top in each cell. (Same way to Figure [sl-lij) 
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Figure 12: Best-performing algorithms with varied constraints. 



NMF works best for sparse data, BPMF works best for dense and large data and PMF 
works best otherwise. 

3. When the time constraint is 1 minutes, PMF and BPMF are additionally excluded 
from consideration. Slope-one works best in most cases, except for the sparsest data 
where NMF works best. 

4. In cases requiring real-time computation, the user average is the best algorithm, except 
for a small region where item average is preferred. 



4 Discussion 



In addition to the conclusions stated in Section |3} we have identified seven groups of CF 
methods, where CF methods in the same group share certain experimental properties: 

• Baselines: Constant, User Average, Item Average 

• Memory-based methods: User-based, Item-based (with and without default values) 

• Matrix-Factorization I: Regularized SVD, PMF, BPMF, NLPMF 

• Matrix-Factorization II: NMF 

• Others (Individually): Slope-one, NPCA, Rank-based CF. 
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Table |4] displays for each of these groups the dependency, accuracy, computational cost, 
and pros and cons. 

We repeat below some of the major conclusions. See Section [3] for more details and 
additional conclusions. 

• Matrix-Factorization-based methods generally have the highest accuracy. Specifically, 
regularized SVD, PMF and its variations perform best as far as MAE and RMSE, 
except in very sparse situations, where NMF performs the best. Matrix-factorization 
methods perform well also in terms of the asymmetric cost and rank-based evaluation 
measures. NPCA and rank-based CF work well in these cases as well. The Slope- 
one method performs well and is computationally efficient. Memory-based methods, 
however, do not have special merit other than simplicity. 

• All algorithms vary in their accuracy, based on the user count, item count, and den- 
sity. The strength and nature of the dependency, however, varies from algorithm to 
algorithm and bivariate relationships change when different values are assigned to the 
third variable. In general cases, high dependence on the user count and item count is 
correlated with high dependency on density, which appeared to be the more influential 
factor. 

• There is trade-off between better accuracy and other factors such as low variance in 
accuracy, computational efficiency, memory consumption, and a smaller number of 
adjustable parameters. That is, the more accurate algorithms tend to depend highly 
on dataset size and density, to have higher variance in accuracy, to be less compu- 
tationally efficient, and to have more adjustable parameters. A careful examination 
of the experimental results can help resolve this tradeoff in a manner that is specific 
to the situation at hand. For example, when computational efficiency is less impor- 
tant, Matrix-Factorization methods are the most appropriate, and when computational 
efficiency is important, slope-one could be a better choice. 

This experimental study, accompanied by an open source software that allows repro- 
ducing our experiments, sheds light on how CF algorithms compare to each other, and 
on their dependency on the problem parameters. The conclusions described above should 
help practitioners, implementing recommendation systems, and researchers examining novel 
state-of-the-art CF methods. 
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